This is the 3rd case presented for the University of Texas at Austin Post Graduate Program in Artificial Intelligence and Machine Learning.
Just like always and as good practice, the first part is always importing the necessary libraries.
# Step one: importing the necessary packages
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
The Dataset used is called bank-full, and it comprises of ~45k observations with client data and outcome of a campaign focused in increasing the adherence to term deposits.
#Step two: import the csv file
df = pd.read_csv('bank-full.csv')
The objective is to create a Machine Learning model, using Ensemble Techniques in order to direct the bank efforts to clients and predict success in the campaigns for the aforementioned product.
Therefore, in order to discover the best model, and, more importantly, figuring out the best course of strategic action for the bank, there will be used a number of techniques to improve those chances.
#Acquiring a basic DataSet description
df.describe().T
df.info()
for feature in df.columns: # Loop through all columns in the dataframe
if df[feature].dtype == 'object': # Only apply for columns with categorical strings
df[feature] = pd.Categorical(df[feature])# Replace strings with an integer
df.head(10)
I like to use Pandas Profiling since it presents a comprehensive, full-scale analysis with little coding. From, those, I will highlight the main discoveries.
#Using pandas profiling for a more detailed understanding of the variables.
profile = ProfileReport(df, title='Pandas Profiling Report', html={'style':{'full_width':True}})
profile.to_notebook_iframe()
Despite not presenting missing values as so, there are a couple of "strange things going on" with the Dataset.
The first one found is the variable pdays, with a very strong frequency in minus 1, meaning the client was contacted in -1 days passed. This does not seem to make sense.
The second one, a little more subtle but also with a higher consequence: poutcoume, with a very high ocurrence of "unknown" outcomes.
For that, those issues will be addressed later on, maybe even presented in different models for comparison.
For now, I'll simply finish the EDA analysis with a couple highlights.
#checking (highlighting) the presence of null values.
df.isnull().sum()
# Just some configs for a better Seaborn experience.
sns.set("poster")
sns.set_style('whitegrid')
#Verifying the outcomes of the campaign based on the remaining factors.
sns.pairplot(df, hue = "poutcome")
#Presenting the Correlation matrix for the variables.
plt.figure( figsize = (20,20))
corr = df.corr()
cmap = sns.diverging_palette(220, 10, as_cmap = True)
sns.heatmap(corr, cmap = cmap, square = True, linewidth = .5, annot = True)
The pairplot and the correlation matrix did not reveal anything standing out. Mainly in terms of importance of features, which do not seem to present one single feature that would, indeed, give us any insight whether there will be a better or worse response.
As mentioned before, two major things caught my attention in the Dataset.
1) Variable pdays with negative values: might mean the person will be contacted in the future, but from there we simply cannot infer anything.
2) Variable poutcome with a lot (36k) of registered "unknown" values. This basically means a campaign was conducted with a certain client and from that we have no idea of what happened next, being either failure or success. My two cents is that this is a great deal of noise to be taken care of. There is high incidence of "other" as well, that may indicate, let's say, hiring a different service than the one offered. This is not the target as well, but not necessarily bad.
From that point on, my first task here is to understand and determine the course of action.
Even so, I will create new objects and by the end, run the "dirty" model for comparison.
#Converting negative pdays value into zero.
df.pdays.replace({-1:0}, inplace = True)
plt.figure( figsize = (20,20))
plt.xlim(-5, 400)
sns.distplot(df.pdays)
First, it is important to understand the nature of the missing data, since it could be Missing at Random (MAR), Missing Completely at Random (MCAR) or Missing not at random (MNAR): source(https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4).
The main difference is that MAR and MCAR presents missing value for no particular reason compared to the target variable.
There seems that the unknown values is somehow related to the time components of the campaign; it seems we are facing a MAR problem.
Still, the original dataset is kept as df and will remain so to be tested.
#Creating a new Dataset, excluding the "unknown" outcomes, which are pure noise.
known_outcomes = df.poutcome != "unknown"
df2 = df[known_outcomes]
df2.describe().T
df.describe().T
#Presenting the Correlation matrix for the variables.
plt.figure( figsize = (20,20))
corr = df2.corr()
cmap = sns.diverging_palette(600, 1, as_cmap = True)
sns.heatmap(corr, cmap = cmap, square = True, linewidth = .5, annot = True)
from sklearn.preprocessing import LabelEncoder
df3 = df2.copy()
lb_make = LabelEncoder()
df3['job'] = lb_make.fit_transform(df2['job'])
df3['marital'] = lb_make.fit_transform(df2['marital'])
df3['education'] = lb_make.fit_transform(df2['education'])
df3['month'] = lb_make.fit_transform(df2['month'])
df3['poutcome'] = lb_make.fit_transform(df2['poutcome'])
df3['contact'] = lb_make.fit_transform(df2['contact'])
df3['default'] = lb_make.fit_transform(df2['default'])
df3['housing'] = lb_make.fit_transform(df2['housing'])
df3['loan'] = lb_make.fit_transform(df2['loan'])
df3['Target'] = lb_make.fit_transform(df2['Target'])
df3.head()
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
all_feats = ['job', 'marital', 'education', 'default', 'housing',
'loan', 'contact', 'month', 'age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous', 'poutcome']
X = df3[all_feats]
y = df3.Target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
import statsmodels.api as sm
logit_model = sm.Logit(y, X)
result = logit_model.fit()
print(result.summary2())
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
y_pred = dTree.predict(X_test)
print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))
dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 6, random_state=1)
dTreeR.fit(X_train, y_train)
print(dTreeR.score(X_train, y_train))
print(dTreeR.score(X_test, y_test))
print (pd.DataFrame(dTreeR.feature_importances_, columns = ["Importance"], index = X_train.columns))
from sklearn import metrics
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
#Plotting the ROC curve
plt.figure( figsize = (20,10))
y_pred_proba = dTree.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.title('AUC dTree')
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
print(dTreeR.score(X_test , y_test))
y_predict = dTreeR.predict(X_test)
cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (10,10))
sns.heatmap(df_cm,cmap = cmap, annot=True ,fmt='g')
from sklearn.ensemble import BaggingClassifier
bgcl = BaggingClassifier(base_estimator=dTree, n_estimators=50,random_state=1)
#bgcl = BaggingClassifier(n_estimators=50,random_state=1)
bgcl = bgcl.fit(X_train, y_train)
y_predict = bgcl.predict(X_test)
print(bgcl.score(X_test , y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (10,10))
sns.heatmap(df_cm, cmap = cmap, annot=True ,fmt='g')
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(n_estimators=10, random_state=1)
#abcl = AdaBoostClassifier( n_estimators=50,random_state=1)
abcl = abcl.fit(X_train, y_train)
y_predict = abcl.predict(X_test)
print(abcl.score(X_test , y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (10,10))
sns.heatmap(df_cm, cmap = cmap, annot=True ,fmt='g')
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50,random_state=1)
gbcl = gbcl.fit(X_train, y_train)
y_predict = gbcl.predict(X_test)
print(gbcl.score(X_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (10,10))
sns.heatmap(df_cm, cmap = cmap, annot=True ,fmt='g')
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50, random_state=1,max_features=12)
rfcl = rfcl.fit(X_train, y_train)
y_predict = rfcl.predict(X_test)
print(rfcl.score(X_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (10,10))
sns.heatmap(df_cm,cmap = cmap, annot=True ,fmt='g')